Comprehensive Regression Analysis & Modeling - "Predicting Obesity Levels"¶
Developed by (Name: Sri Sai Kowshik Reddy Boyalla - UIN: 734003625)
1. Introduction¶
Overview of the project:¶
The purpose of this project is to perform a comprehensive regression analysis to assess obesity levels in individuals as a function of various demographic, dietary, and lifestyle parameters. By analyzing the relationships between predictors and the target variable (obesity levels), this project aims to build reliable regression models to estimate obesity levels, providing valuable insights into public health and nutrition management.
Objectives:¶
Understanding relationships: Examine how demographic, dietary, and lifestyle factors, such as age, gender, family history of overweight, eating habits, physical activity, and others, influence obesity levels.
Developing predictive models: Create and evaluate regression models to accurately predict obesity levels across categories such as normal weight, overweight, and obesity types.
Providing actionable insights: Identify key factors contributing to obesity to inform public health interventions and personalized recommendations for improved health outcomes.
Research Questions:¶
- What are the most significant predictors of obesity levels in individuals based on demographic, dietary, and lifestyle factors?
- How do different regression models (e.g., linear regression, polynomial regression and regularization) perform in predicting obesity levels?
- Can the results inform strategies for public health initiatives and personalized recommendations to manage and prevent obesity?
2. Dataset Selection and Description¶
Source of the Dataset:¶
The dataset used in this project is the Obesity Levels Dataset, sourced from UC Irvine Machine Learning Repository which contains information collected from individuals in Mexico, Peru, and Colombia. This dataset captures various demographic, dietary, and lifestyle parameters along with the obesity levels of individuals. It is a publicly available dataset containing 2,111 observations and 17 variables, with 16 independent variables (features) and 1 dependent variable (target).
Dataset Link: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition
Description of Variables:¶
*1. Features:*
- Gender: Gender of the individual (categorical: Male/Female).
- Age: Age of the individual, measured in years (continuous).
- Height: Height of the individual, measured in meters (continuous).
- Weight: Weight of the individual, measured in kilograms (continuous).
- family_history_with_overweight: Indicates if a family member has a history of being overweight (binary: Yes/No).
- FAVC (Frequent Consumption of High-Caloric Food): Indicates if the individual frequently consumes high-caloric food (binary: Yes/No).
- FCVC (Frequency of Vegetable Consumption): Number of times vegetables are consumed in meals daily (integer).
- NCP (Number of Meals Per Day): Average number of main meals consumed daily (continuous).
- CAEC (Consumption of Food Between Meals): Frequency of consuming food between meals (categorical: No/Sometimes/Frequently/Always).
- SMOKE: Indicates if the individual smokes (binary: Yes/No).
- CH2O (Daily Water Intake): Average daily water consumption, measured in liters (continuous).
- SCC (Calorie Monitoring): Indicates if the individual monitors their calorie intake (binary: Yes/No).
- FAF (Physical Activity Frequency): Frequency of physical activity per week (continuous).
- TUE (Time Using Technology): Average daily time spent using technological devices, measured in hours (integer).
- CALC (Alcohol Consumption): Frequency of alcohol consumption (categorical: No/Sometimes/Frequently/Always).
- MTRANS (Mode of Transportation): The primary mode of transportation used by the individual (categorical: Walking/Public Transportation/Automobile/Bike/Motorbike).
*2. Target:*
- NObeyesdad (Obesity Level): Obesity level of the individual, categorized as: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, Obesity Type III
These variables provide a comprehensive view of the individual’s demographic profile, dietary habits, physical activity, and lifestyle factors, enabling the analysis and modeling of obesity levels.
3. Data Pre-Processing¶
import pandas as pd
# Loading the dataset
file_path = 'path_to_dataset.csv' # Replace with your file path
data = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")
# Displaying the first few rows and data set information
data_info = data.info()
data_head = data.head()
data_info
data_head
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 2111 non-null object 1 Age 2111 non-null float64 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 family_history_with_overweight 2111 non-null object 5 FAVC 2111 non-null object 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 CAEC 2111 non-null object 9 SMOKE 2111 non-null object 10 CH2O 2111 non-null float64 11 SCC 2111 non-null object 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CALC 2111 non-null object 15 MTRANS 2111 non-null object 16 NObeyesdad 2111 non-null object dtypes: float64(8), object(9) memory usage: 280.5+ KB
| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 21.0 | 1.62 | 64.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 0.0 | 1.0 | no | Public_Transportation | Normal_Weight |
| 1 | Female | 21.0 | 1.52 | 56.0 | yes | no | 3.0 | 3.0 | Sometimes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
| 2 | Male | 23.0 | 1.80 | 77.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
| 3 | Male | 27.0 | 1.80 | 87.0 | no | no | 3.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 0.0 | Frequently | Walking | Overweight_Level_I |
| 4 | Male | 22.0 | 1.78 | 89.8 | no | no | 2.0 | 1.0 | Sometimes | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
# Checking for missing values in each column
missing_values = data.isnull().sum()
print(missing_values)
Gender 0 Age 0 Height 0 Weight 0 family_history_with_overweight 0 FAVC 0 FCVC 0 NCP 0 CAEC 0 SMOKE 0 CH2O 0 SCC 0 FAF 0 TUE 0 CALC 0 MTRANS 0 NObeyesdad 0 dtype: int64
# Checking for duplicates
duplicates = data.duplicated().sum()
# Printing duplicates count
print(duplicates)
# Removing the duplicates
data = data.drop_duplicates()
24
# Displaying summary statistics and Checking for outliers in continuous variables (e.g., Age, Weight, Height)
data.describe()
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
|---|---|---|---|---|---|---|---|---|
| count | 2087.000000 | 2087.000000 | 2087.000000 | 2087.000000 | 2087.000000 | 2087.000000 | 2087.000000 | 2087.000000 |
| mean | 24.353090 | 1.702674 | 86.858730 | 2.421466 | 2.701179 | 2.004749 | 1.012812 | 0.663035 |
| std | 6.368801 | 0.093186 | 26.190847 | 0.534737 | 0.764614 | 0.608284 | 0.853475 | 0.608153 |
| min | 14.000000 | 1.450000 | 39.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 19.915937 | 1.630178 | 66.000000 | 2.000000 | 2.697467 | 1.590922 | 0.124505 | 0.000000 |
| 50% | 22.847618 | 1.701584 | 83.101100 | 2.396265 | 3.000000 | 2.000000 | 1.000000 | 0.630866 |
| 75% | 26.000000 | 1.769491 | 108.015907 | 3.000000 | 3.000000 | 2.466193 | 1.678102 | 1.000000 |
| max | 61.000000 | 1.980000 | 173.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 2.000000 |
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
# Copying the dataset to ensure the original remains intact
transformed_data = data.copy()
# Label Encoding for binary and ordinal variables
label_encoder = LabelEncoder()
binary_features = ['family_history_with_overweight', 'FAVC', 'SMOKE', 'SCC']
target_variable = 'NObeyesdad'
for feature in binary_features:
transformed_data[feature] = label_encoder.fit_transform(transformed_data[feature])
# Label encoding the target variable (Ordinal Relationship Exists)
transformed_data[target_variable] = label_encoder.fit_transform(transformed_data[target_variable])
# One-Hot Encoding for nominal variables
nominal_features = ['Gender', 'CAEC', 'CALC', 'MTRANS']
transformed_data = pd.get_dummies(transformed_data, columns=nominal_features, drop_first=True)
# Scaling the continuous features using Min-Max Scaling
scaler = MinMaxScaler()
continuous_features = ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
transformed_data[continuous_features] = scaler.fit_transform(transformed_data[continuous_features])
# Displaying Transformed Dataset after pre-processing
transformed_data.head()
| Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | SMOKE | CH2O | SCC | ... | CAEC_Frequently | CAEC_Sometimes | CAEC_no | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.148936 | 0.320755 | 0.186567 | 1 | 0 | 0.5 | 0.666667 | 0 | 0.5 | 0 | ... | False | True | False | False | False | True | False | False | True | False |
| 1 | 0.148936 | 0.132075 | 0.126866 | 1 | 0 | 1.0 | 0.666667 | 1 | 1.0 | 1 | ... | False | True | False | False | True | False | False | False | True | False |
| 2 | 0.191489 | 0.660377 | 0.283582 | 1 | 0 | 0.5 | 0.666667 | 0 | 0.5 | 0 | ... | False | True | False | True | False | False | False | False | True | False |
| 3 | 0.276596 | 0.660377 | 0.358209 | 0 | 0 | 1.0 | 0.666667 | 0 | 0.5 | 0 | ... | False | True | False | True | False | False | False | False | False | True |
| 4 | 0.170213 | 0.622642 | 0.379104 | 0 | 0 | 0.5 | 0.000000 | 0 | 0.5 | 0 | ... | False | True | False | False | True | False | False | False | True | False |
5 rows × 24 columns
4. Exploratory Data Analysis (EDA)¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mode
# Univariate Analysis
# Summary statistics for numerical variables
summary_stats = transformed_data.describe().T
summary_stats['mode'] = transformed_data[continuous_features].mode().iloc[0]
summary_stats = summary_stats[['mean', '50%', 'std', 'min', 'max', 'mode']]
summary_stats.rename(columns={'50%': 'median'}, inplace=True)
# Frequency distribution for categorical variables
categorical_features = [
col for col in transformed_data.columns
if transformed_data[col].nunique() <= 10 and col not in ['NObeyesdad']
]
freq_distribution = {}
for col in categorical_features:
freq_distribution[col] = transformed_data[col].value_counts()
# Plotting Histograms and Box Plots for numerical variables
for feature in continuous_features:
# Histogram with Density
plt.figure(figsize=(10, 5))
sns.histplot(transformed_data[feature], kde=True, bins=20, color='blue', alpha=0.6)
plt.title(f"Histogram and Density Plot for {feature}")
plt.xlabel(feature)
plt.ylabel("Frequency")
plt.show()
# Box Plot
plt.figure(figsize=(10, 5))
sns.boxplot(x=transformed_data[feature], color='green')
plt.title(f"Box Plot for {feature}")
plt.xlabel(feature)
plt.show()
# Displaying summary statistics and frequency distributions
print(summary_stats)
# Frequency distributions for categorical variables
print(freq_distribution)
mean median std min max \
Age 0.220279 0.188247 0.135506 0.0 1.0
Height 0.476744 0.474687 0.175823 0.0 1.0
Weight 0.357155 0.329113 0.195454 0.0 1.0
family_history_with_overweight 0.825108 1.000000 0.379966 0.0 1.0
FAVC 0.883565 1.000000 0.320823 0.0 1.0
FCVC 0.710733 0.698133 0.267368 0.0 1.0
NCP 0.567060 0.666667 0.254871 0.0 1.0
SMOKE 0.021083 0.000000 0.143695 0.0 1.0
CH2O 0.502375 0.500000 0.304142 0.0 1.0
SCC 0.045999 0.000000 0.209533 0.0 1.0
FAF 0.337604 0.333333 0.284492 0.0 1.0
TUE 0.331518 0.315433 0.304077 0.0 1.0
NObeyesdad 3.014375 3.000000 1.948470 0.0 6.0
mode
Age 0.085106
Height 0.471698
Weight 0.305970
family_history_with_overweight NaN
FAVC NaN
FCVC 1.000000
NCP 0.666667
SMOKE NaN
CH2O 0.500000
SCC NaN
FAF 0.000000
TUE 0.000000
NObeyesdad NaN
{'family_history_with_overweight': family_history_with_overweight
1 1722
0 365
Name: count, dtype: int64, 'FAVC': FAVC
1 1844
0 243
Name: count, dtype: int64, 'SMOKE': SMOKE
0 2043
1 44
Name: count, dtype: int64, 'SCC': SCC
0 1991
1 96
Name: count, dtype: int64, 'Gender_Male': Gender_Male
True 1052
False 1035
Name: count, dtype: int64, 'CAEC_Frequently': CAEC_Frequently
False 1851
True 236
Name: count, dtype: int64, 'CAEC_Sometimes': CAEC_Sometimes
True 1761
False 326
Name: count, dtype: int64, 'CAEC_no': CAEC_no
False 2050
True 37
Name: count, dtype: int64, 'CALC_Frequently': CALC_Frequently
False 2017
True 70
Name: count, dtype: int64, 'CALC_Sometimes': CALC_Sometimes
True 1380
False 707
Name: count, dtype: int64, 'CALC_no': CALC_no
False 1451
True 636
Name: count, dtype: int64, 'MTRANS_Bike': MTRANS_Bike
False 2080
True 7
Name: count, dtype: int64, 'MTRANS_Motorbike': MTRANS_Motorbike
False 2076
True 11
Name: count, dtype: int64, 'MTRANS_Public_Transportation': MTRANS_Public_Transportation
True 1558
False 529
Name: count, dtype: int64, 'MTRANS_Walking': MTRANS_Walking
False 2032
True 55
Name: count, dtype: int64}
import matplotlib.pyplot as plt
import seaborn as sns
# Bivariate Analysis
# Set style for better visualization
sns.set(style="whitegrid")
# Scatter plots for continuous variables vs. target variable
continuous_features = ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
target_variable = 'NObeyesdad'
# Creating scatter plots for continuous features against the target
for feature in continuous_features:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=transformed_data, x=feature, y=target_variable, alpha=0.7)
plt.title(f"{feature} vs. {target_variable}")
plt.xlabel(feature)
plt.ylabel("Obesity Level")
plt.show()
# Correlation Matrix for Continuous Features
plt.figure(figsize=(12, 8))
correlation_matrix = transformed_data[continuous_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix for Continuous Features")
plt.show()
# Pair Plot to visualize pairwise relationships
sns.pairplot(transformed_data[continuous_features + [target_variable]], diag_kind="kde", hue=target_variable)
plt.show()
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Multivariate Analysis
# Heatmap for full dataset correlation (continuous features and encoded categorical variables)
plt.figure(figsize=(16, 12))
sns.heatmap(transformed_data.corr(), annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap (All Features)")
plt.show()
# Dimensionality Reduction with PCA
# Standardizing the dataset for PCA
scaler = StandardScaler()
scaled_data = scaler.fit_transform(transformed_data)
# Applying PCA to reduce to 2 principal components for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)
# Adding PCA results back to the dataframe for plotting
transformed_data['PCA1'] = pca_result[:, 0]
transformed_data['PCA2'] = pca_result[:, 1]
# Scatter plot of the first two principal components
plt.figure(figsize=(10, 8))
sns.scatterplot(data=transformed_data, x='PCA1', y='PCA2', hue='NObeyesdad', alpha=0.7, palette="viridis")
plt.title("PCA: First Two Principal Components")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Obesity Levels")
plt.show()
# Pairwise relationships with a subset of features for clarity
pairwise_features = ['Age', 'Weight', 'FAF', 'TUE', 'NObeyesdad']
sns.pairplot(transformed_data[pairwise_features], hue='NObeyesdad', diag_kind="kde", palette="viridis")
plt.show()
5. Regression Analysis¶
Linear Regression¶
# 5.1 Linear Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Selecting a single predictor variable for simple linear regression
predictor = 'Weight' # Example predictor
target = 'NObeyesdad' # Target variable
# Split the data into training and testing sets
X = transformed_data[[predictor]]
y = transformed_data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and fit the linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Predicting on test data
y_pred = linear_model.predict(X_test)
# Calculating R2 and RMSE
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# Visualizing the regression line
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted (Regression Line)')
plt.title(f'Simple Linear Regression: {predictor} vs {target}')
plt.xlabel(predictor)
plt.ylabel(target)
plt.legend()
plt.show()
# Visualizing residuals
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals, alpha=0.6, color='purple')
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.title('Residuals of Simple Linear Regression')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
# Performance metrics
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R2): {r2:.2f}")
Root Mean Squared Error (RMSE): 1.76 R-squared (R2): 0.17
Multiple Linear Regression¶
# 5.2 Multiple Linear Regression
from statsmodels.api import OLS, add_constant
# Selecting multiple predictor variables for multiple linear regression
predictors = ['Weight', 'Height', 'Age', 'FCVC', 'CH2O', 'FAF', 'TUE']
target = 'NObeyesdad'
# Splitting the data into features (X) and target (y)
X = transformed_data[predictors]
y = transformed_data[target]
# Adding a constant term for the intercept in the regression model
X_with_constant = add_constant(X)
# Splitting into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X_with_constant, y, test_size=0.2, random_state=42)
# Fitting the multiple linear regression model using statsmodels
model = OLS(y_train, X_train).fit()
# Predicting on the test dataset
y_pred = model.predict(X_test)
# Evaluating the model
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
# Analyzing the significance of predictor variables (summary of the model)
model_summary = model.summary()
# Displaying the performance metrics
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R2): {r2:.2f}")
print(model_summary)
Root Mean Squared Error (RMSE): 1.68
R-squared (R2): 0.24
OLS Regression Results
==============================================================================
Dep. Variable: NObeyesdad R-squared: 0.197
Model: OLS Adj. R-squared: 0.194
Method: Least Squares F-statistic: 58.27
Date: Tue, 26 Nov 2024 Prob (F-statistic): 6.63e-75
Time: 20:02:20 Log-Likelihood: -3300.0
No. Observations: 1669 AIC: 6616.
Df Residuals: 1661 BIC: 6659.
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.2739 0.213 10.683 0.000 1.856 2.691
Weight 4.2937 0.271 15.821 0.000 3.761 4.826
Height -1.6955 0.303 -5.599 0.000 -2.289 -1.102
Age 1.9613 0.341 5.744 0.000 1.292 2.631
FCVC -0.6502 0.169 -3.837 0.000 -0.983 -0.318
CH2O 0.3615 0.148 2.436 0.015 0.070 0.653
FAF -0.3541 0.165 -2.147 0.032 -0.678 -0.031
TUE 0.0467 0.150 0.312 0.755 -0.247 0.341
==============================================================================
Omnibus: 446.705 Durbin-Watson: 1.960
Prob(Omnibus): 0.000 Jarque-Bera (JB): 223.075
Skew: 0.752 Prob(JB): 3.63e-49
Kurtosis: 2.027 Cond. No. 14.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Polynomial Regression¶
# 5.3 Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Selecting predictor and target variables for polynomial regression
predictor = 'Weight' # Example predictor for simplicity
target = 'NObeyesdad'
# Preparing the data
X = transformed_data[[predictor]]
y = transformed_data[target]
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Applying Polynomial Regression with degree 2
degree = 2
polynomial_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
polynomial_model.fit(X_train, y_train)
# Predicting on test data
y_pred_poly = polynomial_model.predict(X_test)
# Evaluating the Polynomial Regression Model
r2_poly = r2_score(y_test, y_pred_poly)
rmse_poly = np.sqrt(mean_squared_error(y_test, y_pred_poly))
# Comparing Polynomial Regression with Linear Regression
# Reusing the previously trained linear regression model for comparison
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)
r2_linear = r2_score(y_test, y_pred_linear)
rmse_linear = np.sqrt(mean_squared_error(y_test, y_pred_linear))
# Visualizing Polynomial Regression vs Linear Regression
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual')
plt.scatter(X_test, y_pred_poly, color='green', alpha=0.6, label='Polynomial Predictions')
plt.plot(X_test, y_pred_linear, color='red', linewidth=2, label='Linear Regression Line')
plt.title(f'Polynomial Regression (Degree {degree}) vs Linear Regression')
plt.xlabel(predictor)
plt.ylabel(target)
plt.legend()
plt.show()
# Displaying the performance metrics
print(f"Polynomial Regression - Root Mean Squared Error (RMSE): {rmse_poly:.2f}")
print(f"Polynomial Regression - R-squared (R2): {r2_poly:.2f}")
print(f"Linear Regression - Root Mean Squared Error (RMSE): {rmse_linear:.2f}")
print(f"Linear Regression - R-squared (R2): {r2_linear:.2f}")
Polynomial Regression - Root Mean Squared Error (RMSE): 1.58 Polynomial Regression - R-squared (R2): 0.33 Linear Regression - Root Mean Squared Error (RMSE): 1.76 Linear Regression - R-squared (R2): 0.17
Logistic Refression¶
# 5.4 Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve
# Preparing the data for logistic regression
# Simplifying the task to binary classification: Normal Weight vs Others
transformed_data['BinaryTarget'] = transformed_data['NObeyesdad'].apply(lambda x: 1 if x == 2 else 0) # Example binary encoding
X = transformed_data[predictors] # Using multiple predictors
y = transformed_data['BinaryTarget']
# Splitting into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)
# Making the predictions
y_pred = logistic_model.predict(X_test)
y_pred_prob = logistic_model.predict_proba(X_test)[:, 1]
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)
# Displaying the Results
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Chance')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
Accuracy: 0.83 Precision: 0.50 Recall: 0.01 ROC-AUC: 0.79
Regularization Techniques¶
# 5.5 Regularization Techniques
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
# Regularization models - Ridge Regression, Lasso Regression, Elastic Net Regression
ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.01, max_iter=10000)
elastic_net_model = ElasticNet(alpha=0.01, l1_ratio=0.5, max_iter=10000)
# Performing cross-validation to evaluate each model
ridge_scores = cross_val_score(ridge_model, X, y, cv=5, scoring='r2')
lasso_scores = cross_val_score(lasso_model, X, y, cv=5, scoring='r2')
elastic_net_scores = cross_val_score(elastic_net_model, X, y, cv=5, scoring='r2')
# Training models on full training data
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
elastic_net_model.fit(X_train, y_train)
# Predicting on the test set
ridge_pred = ridge_model.predict(X_test)
lasso_pred = lasso_model.predict(X_test)
elastic_net_pred = elastic_net_model.predict(X_test)
# Evaluating models
ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_pred))
lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_pred))
elastic_net_rmse = np.sqrt(mean_squared_error(y_test, elastic_net_pred))
ridge_r2 = r2_score(y_test, ridge_pred)
lasso_r2 = r2_score(y_test, lasso_pred)
elastic_net_r2 = r2_score(y_test, elastic_net_pred)
# Printing the performance comparison
print("Performance Comparison of Regularization Techniques:")
print(f"Ridge Regression: R2 = {ridge_r2:.2f}, RMSE = {ridge_rmse:.2f}, CV Mean R2 = {ridge_scores.mean():.3f}")
print(f"Lasso Regression: R2 = {lasso_r2:.2f}, RMSE = {lasso_rmse:.2f}, CV Mean R2 = {lasso_scores.mean():.3f}")
print(f"Elastic Net: R2 = {elastic_net_r2:.2f}, RMSE = {elastic_net_rmse:.2f}, CV Mean R2 = {elastic_net_scores.mean():.3f}")
Performance Comparison of Regularization Techniques: Ridge Regression: R2 = 0.12, RMSE = 0.35, CV Mean R2 = -0.809 Lasso Regression: R2 = 0.03, RMSE = 0.37, CV Mean R2 = -1.124 Elastic Net: R2 = 0.06, RMSE = 0.36, CV Mean R2 = -1.128
Advanced Regression Techniques¶
Quantile Regression¶
# 5.6 Quantile Regression
import statsmodels.api as sm
import numpy as np
# Selecting predictor and target for quantile regression
predictor = 'Weight' # Example predictor
target = 'NObeyesdad'
X = transformed_data[[predictor]]
y = transformed_data[target]
# Adding a constant for the intercept
X_with_const = sm.add_constant(X)
# Quantiles to predict
quantiles = [0.1, 0.5, 0.9]
# Fitting quantile regression models
quantile_models = {}
for q in quantiles:
quantile_model = sm.QuantReg(y, X_with_const).fit(q=q)
quantile_models[q] = quantile_model
print(f"Quantile {q}:")
print(quantile_model.summary())
# Plotting the regression lines for different quantiles
plt.figure(figsize=(8, 6))
plt.scatter(X, y, alpha=0.5, label='Data', color='gray')
# Predict for the quantile lines
x_range = np.linspace(X[predictor].min(), X[predictor].max(), 100)
x_range_with_const = sm.add_constant(pd.DataFrame({predictor: x_range}))
for q, quantile_model in quantile_models.items():
y_quantile = quantile_model.predict(x_range_with_const)
plt.plot(x_range, y_quantile, label=f'Quantile {q}', linewidth=2)
plt.title('Quantile Regression Lines')
plt.xlabel(predictor)
plt.ylabel(target)
plt.legend()
plt.show()
Quantile 0.1:
QuantReg Regression Results
==============================================================================
Dep. Variable: NObeyesdad Pseudo R-squared: 0.3894
Model: QuantReg Bandwidth: 0.6117
Method: Least Squares Sparsity: 2.808
Date: Tue, 26 Nov 2024 No. Observations: 2087
Time: 20:12:47 Df Residuals: 2085
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.4903 0.041 -11.913 0.000 -0.571 -0.410
Weight 5.1766 0.103 50.258 0.000 4.975 5.379
==============================================================================
Quantile 0.5:
QuantReg Regression Results
==============================================================================
Dep. Variable: NObeyesdad Pseudo R-squared: 0.1759
Model: QuantReg Bandwidth: 0.7470
Method: Least Squares Sparsity: 2.692
Date: Tue, 26 Nov 2024 No. Observations: 2087
Time: 20:12:47 Df Residuals: 2085
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3435 0.061 5.596 0.000 0.223 0.464
Weight 5.1645 0.151 34.252 0.000 4.869 5.460
==============================================================================
Quantile 0.9:
QuantReg Regression Results
==============================================================================
Dep. Variable: NObeyesdad Pseudo R-squared: 0.007249
Model: QuantReg Bandwidth: 0.6117
Method: Least Squares Sparsity: 6.071
Date: Tue, 26 Nov 2024 No. Observations: 2087
Time: 20:12:47 Df Residuals: 2085
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 6.9793 0.055 127.408 0.000 6.872 7.087
Weight -3.2741 0.107 -30.704 0.000 -3.483 -3.065
==============================================================================
*Usage:* Quantile Regression was used to predict obesity levels across different quantiles, capturing variations in relationships between predictors (e.g., weight) and the target variable. Unlike ordinary regression, it allows for modeling the impact of predictors on the distribution tails, highlighting differences in behavior for individuals with low, median, and high obesity levels. This approach provides a nuanced understanding of the predictors' influence, supporting more targeted public health interventions.
6. Model Evaluation and Comparision¶
# Model evaluation summary across all methods used
evaluation_metrics = {
"Linear Regression": {"R2": r2_linear, "RMSE": rmse_linear},
"Polynomial Regression": {"R2": r2_poly, "RMSE": rmse_poly},
"Ridge Regression": {"R2": ridge_r2, "RMSE": ridge_rmse},
"Lasso Regression": {"R2": lasso_r2, "RMSE": lasso_rmse},
"Elastic Net Regression": {"R2": elastic_net_r2, "RMSE": elastic_net_rmse}
}
# Displaying the metrics comparison as a DataFrame for clarity
evaluation_df = pd.DataFrame(evaluation_metrics).T
# Visualizing model performance comparison
plt.figure(figsize=(12, 6))
evaluation_df['R2'].plot(kind='bar', color='skyblue', alpha=0.8, label='R2 Score')
plt.axhline(0, color='gray', linestyle='--', linewidth=1)
plt.title('R2 Score Comparison Across Models')
plt.ylabel('R2 Score')
plt.legend()
plt.show()
plt.figure(figsize=(12, 6))
evaluation_df['RMSE'].plot(kind='bar', color='salmon', alpha=0.8, label='RMSE')
plt.axhline(0, color='gray', linestyle='--', linewidth=1)
plt.title('RMSE Comparison Across Models')
plt.ylabel('Root Mean Squared Error')
plt.legend()
plt.show()
# Displaying evaluation metrics for interpretation
evaluation_df.head()
| R2 | RMSE | |
|---|---|---|
| Linear Regression | 0.172825 | 1.756740 |
| Polynomial Regression | 0.334466 | 1.575774 |
| Ridge Regression | 0.118638 | 0.350542 |
| Lasso Regression | 0.032911 | 0.367194 |
| Elastic Net Regression | 0.062974 | 0.361442 |
7. Results and Interpretation¶
Summary of Findings¶
*1. Regression Analysis:*
- Simple linear regression, using a single predictor like weight, demonstrated the influence of individual factors on obesity levels. However, the R2 value was moderate, indicating limited explanatory power.
- Multiple linear regression improved the model significantly by incorporating a broader set of predictors (e.g., age, height, physical activity, and dietary habits). This highlighted the importance of considering multiple factors for better prediction.
*2. Polynomial Regression:*
- Captured non-linear relationships between predictors (e.g., weight, vegetable consumption) and obesity levels. This model achieved improved R2 and reduced RMSE, indicating a better fit for the dataset's complex patterns.
*3. Regularization Techniques:*
- Ridge, Lasso, and Elastic Net regression effectively addressed multicollinearity. Ridge maintained strong predictive performance, while Lasso simplified the model by reducing less significant predictors, balancing interpretability and accuracy.
*4. Quantile Regression:*
- Revealed insights into how predictors influence different obesity quantiles, showing varying effects of weight and physical activity on individuals with low, median, and high obesity levels.
*5. Model Comparisons:*
- Polynomial regression and regularized models emerged as top performers for predictive accuracy. Quantile regression stood out for its ability to uncover differential patterns across the obesity spectrum.
Insights¶
*1. Key Factors Driving Obesity Levels:*
- Weight: A significant predictor across all models, showing that higher weight strongly correlates with higher obesity levels. This emphasizes the need for weight management interventions.
- Physical Activity (FAF): Higher levels of physical activity are linked to lower obesity levels. Encouraging regular exercise can be an effective strategy for preventing and managing obesity.
- Dietary Habits (FCVC & CH2O): Increased vegetable consumption and adequate water intake are associated with healthier weight categories, reinforcing the importance of balanced diets.
*2. Non-Linear Relationships:*
- Polynomial regression revealed that the relationship between predictors (like weight) and obesity levels is not strictly linear. For instance, small increases in weight can lead to much larger changes in obesity levels for individuals in higher categories. This indicates that interventions may need to scale differently for individuals at higher obesity levels, focusing more aggressively on diet and activity.
*3. Segment-Specific Trends from Quantile Regression:*
- Individuals in the highest obesity categories (e.g., Obesity Type III) are more affected by reduced physical activity and excessive weight compared to those in normal or overweight categories. These findings suggest that interventions for extreme cases should prioritize increasing activity levels and reducing calorie intake.
*4. Combined Effects of Predictors:*
- Multiple linear regression showed that no single factor fully explains obesity levels. Instead, a combination of demographic (age, gender), dietary (vegetable and water intake), and lifestyle factors (activity levels, screen time) provides the best predictive accuracy. This highlights the importance of a holistic approach to obesity management.
*5. Implications for Public Health:*
- Strategies like promoting healthier eating habits (e.g., increased vegetable intake) and active lifestyles can significantly impact obesity prevention at the population level. For individuals with severe obesity, targeted interventions like structured exercise programs and caloric restrictions may yield the greatest benefits.
Limitations¶
*1. Dataset Scope:*
- The dataset includes only individuals from Mexico, Peru, and Colombia, potentially limiting the findings' applicability to other populations with different demographics or cultural habits.
*2. Predictor Coverage:*
- Key factors such as genetic predisposition, stress, socio-economic conditions, and psychological factors were not captured in the dataset, which might restrict the model's explanatory power.
*3. Model Assumptions:*
- Models like linear regression assume linearity, while even polynomial regression assumes fixed patterns. Real-world obesity factors may involve more dynamic or unmeasured interactions.
*4. Quantile Regression Complexity:*
- While quantile regression provides valuable segment-specific insights, interpreting its results across multiple quantiles and translating these into actionable recommendations can be challenging.
8. References¶
- Estimation of Obesity Levels Based On Eating Habits and Physical Condition dataset - https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition
- Dr. Yoonsung Jung’s Notes • STAT 650 Course Notes - https://canvas.tamu.edu/
- Pandas documentation - https://pandas.pydata.org/docs/
- Scikit-Learn documentation - https://scikit-learn.org/stable/
- Seaborn documentation - https://seaborn.pydata.org/